In this notebook we will create an interactive map of the UK which displays the % of Trans men in each local authority. There are 331 local authorities in the UK, and we are using data collected in the 2021 UK Census which included 2 new questions on sexuality and gender identity. The following data used are:
If you’re running this code on your own PC (and not through the Binder link) then you’re going to want to uncomment the lines below so you can install the requisite packages. Another thing to remember is to set your working directory to the correct folder. Otherwise reading in data will be difficult.
# install.packages("leaflet")
# install.packages("sf")
# install.packages("dplyr")
# install.packages("readr")
# used to read-in datasets
library(readr)
# used to manipulate datasets
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# used to read-in spatial data, shapefiles
library(sf)
## Linking to GEOS 3.9.1, GDAL 3.2.3, PROJ 7.2.1
# used to create interactive maps
library(leaflet)
# First, let's read in our gender identity dataset
df <- read_csv('../Data/GI_det.csv')
## Rows: 2648 Columns: 5
## ── Column specification ──────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Lower tier local authorities Code, Lower tier local authorities, Gender ident...
## dbl (2): Gender identity (8 categories) Code, Observation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Use head function to check out the first few rows - but can also access df via environment pane
head(df, 10)
## # A tibble: 10 × 5
## `Lower tier local authorities Code` Lower tier local authoriti…¹ Gender identity (8 c…²
## <chr> <chr> <dbl>
## 1 E06000001 Hartlepool -8
## 2 E06000001 Hartlepool 1
## 3 E06000001 Hartlepool 2
## 4 E06000001 Hartlepool 3
## 5 E06000001 Hartlepool 4
## 6 E06000001 Hartlepool 5
## 7 E06000001 Hartlepool 6
## 8 E06000001 Hartlepool 7
## 9 E06000002 Middlesbrough -8
## 10 E06000002 Middlesbrough 1
## # ℹ abbreviated names: ¹​`Lower tier local authorities`,
## # ²​`Gender identity (8 categories) Code`
## # ℹ 2 more variables: `Gender identity (8 categories)` <chr>, Observation <dbl>
Before we can calculate the %’s of trans men in each local authority, it’s good to do some housekeeping and get our dataframe in order.
There’s a few things that need sorting including:
The pipe operator is used to pass the result of one function directly into the next one. E.g. let’s say we had some code:
sorted_data <- my_data %\>% filter(condition) %\>% arrange(sorting_variable)
What we’re doing is using the pipe operator to pass my_data to the filter() function, and the result of this is then passed to the arrange() function.
Basically, pipes allow us to chain together a sequence of functions in a way that’s easy to read and understand.
In the code below we use the pipe operator to pass our dataframe to the rename function.
This basically supplies the rename function with its first argument, which is the dataframe to filter on.
# Rename columns using the rename function from dplyr
# Specify what you want to rename the column to, and supply the original column string
df <- df %>%
rename(LA_code = `Lower tier local authorities Code`,
# backticks ` necessary when names are syntactically invalid, e.g. spaces, special characters etc.
LA_name = `Lower tier local authorities`,
GI_code = `Gender identity (8 categories) Code`,
GI_cat = `Gender identity (8 categories)`)
# Let's use the colnames function to see if it worked
colnames(df)
## [1] "LA_code" "LA_name" "GI_code" "GI_cat" "Observation"
Logical operators are used to perform comparisons between values or expressions, which result in a logical (Boolean) value of ‘TRUE’ or ‘FALSE’.
In the code below we use the ‘!=’ ‘Does not equal’ operator which tests if the GI_cat value in each row of the df does not equal the string ‘Does not apply’.
For each row where GI_cat is not equal to ‘Does not apply’, the expression valuates to TRUE.
We filter so we only keep rows where this expression evaluates to TRUE.
# Use dplyr's filter function to get rid of 'Does not apply'
# Use '!=' to keep everything except 'Does not apply' category
df <- df %>% filter(GI_cat != 'Does not apply')
This operator is used to access elements, such as columns of a dataframe, by name.Below, we use it to access the gender identity category column, where we want to view the unique values.
# Unique function can be applied to a column in a df to see which values are in that column
# Let's see if 'Does not apply' has been successfully dropped
unique(df$GI_cat)
## [1] "Gender identity the same as sex registered at birth"
## [2] "Gender identity different from sex registered at birth but no specific identity given"
## [3] "Trans woman"
## [4] "Trans man"
## [5] "Non-binary"
## [6] "All other gender identities"
## [7] "Not answered"
Now onto the more interesting stuff. The data pre-processing stage involves preparing and transforming data into a suitable format for further analysis.It can involve selecting features, transforming variables, and creating new variables.For our purposes, we need to create a new column ‘Percentages’ which contains the % of Trans men in each local authority.
So, we’ll need to first calculate the % of each gender identity category for each local authority. Then, we’ll want to filter our dataset so that we only keep the responses related to Trans men.
# Use group_by to group the dataframe by the LA_name column
# Use mutate to perform calculation within each LA_name group, convert result to a % by multiplying by 100
# round() is used to round %'s to 2 decimal places
df <- df %>%
group_by(LA_name) %>%
mutate(Percentage = round(Observation / sum(Observation) * 100, 2))
# Let's check out the results
head(df, 10)
## # A tibble: 10 × 6
## # Groups: LA_name [2]
## LA_code LA_name GI_code GI_cat Observation Percentage
## <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 E06000001 Hartlepool 1 Gender identity the same as sex… 70588 94.5
## 2 E06000001 Hartlepool 2 Gender identity different from … 167 0.22
## 3 E06000001 Hartlepool 3 Trans woman 49 0.07
## 4 E06000001 Hartlepool 4 Trans man 51 0.07
## 5 E06000001 Hartlepool 5 Non-binary 33 0.04
## 6 E06000001 Hartlepool 6 All other gender identities 21 0.03
## 7 E06000001 Hartlepool 7 Not answered 3777 5.06
## 8 E06000002 Middlesbrough 1 Gender identity the same as sex… 106009 93.0
## 9 E06000002 Middlesbrough 2 Gender identity different from … 496 0.44
## 10 E06000002 Middlesbrough 3 Trans woman 141 0.12
# Use filter() to only keep rows where GI_cat equals 'Trans man'
df <- df %>%
filter(GI_cat == 'Trans man') %>%
# Use select() with '-' to remove 'Observation' column
select(-Observation) %>%
# Use distinct() to remove duplicate rows, as a precaution
distinct() %>%
# Use ungroup() to remove grouping - resetting the dataframes state after performing group operations is good practice
ungroup()
# Let's take a look at the results
head(df)
## # A tibble: 6 × 5
## LA_code LA_name GI_code GI_cat Percentage
## <chr> <chr> <dbl> <chr> <dbl>
## 1 E06000001 Hartlepool 4 Trans man 0.07
## 2 E06000002 Middlesbrough 4 Trans man 0.15
## 3 E06000003 Redcar and Cleveland 4 Trans man 0.09
## 4 E06000004 Stockton-on-Tees 4 Trans man 0.08
## 5 E06000005 Darlington 4 Trans man 0.08
## 6 E06000006 Halton 4 Trans man 0.08
Now that we have our gender identity dataset sorted, we can start on the mapping process. And that starts with reading in our shapefile, which we should have downloaded from the geoportal. If (like me) you don’t work with spatial data much, you might assume that you only need the shapefile, and you might delete the others that come with the folder. However, a shapefile is not just a single .shp file, but a collection of files that work together, and each of these files plays a crucial role in defining the shapefile’s data and behaviour. When you try and read a shapefile into R, the software expects all components to be present, and missing them can lead to errors or incorrect spatial references. E.g. without the .dbf file, you’d lose all attribute data associated with the geographic features, and without the .shx file you might not be able to read the .shp file altogether.
TLDR: Make sure when you download the shapefile folder you keep all the files!
Anyway, let’s get started.
# Read in shapefile to a simple features object
# st_read() reads in spatial data to a 'simple features' object
sf <- st_read('/Users/loucap/Documents/GitWork/Interactive_vis_new/Shapefiles/LADS_May_2022/LAD_MAY_2022_UK_BGC_V3.shp')
## Reading layer `LAD_MAY_2022_UK_BGC_V3' from data source
## `/Users/loucap/Documents/GitWork/Interactive_vis_new/Shapefiles/LADS_May_2022/LAD_MAY_2022_UK_BGC_V3.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 374 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -116.1928 ymin: 5352.6 xmax: 655653.8 ymax: 1220299
## Projected CRS: OSGB 1936 / British National Grid
# Let's check it out
head(sf)
## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 344666.1 ymin: 378867 xmax: 478433.4 ymax: 537152
## Projected CRS: OSGB 1936 / British National Grid
## LAD22CD LAD22NM BNG_E BNG_N LONG LAT
## 1 E06000001 Hartlepool 447160 531474 -1.27018 54.6761
## 2 E06000002 Middlesbrough 451141 516887 -1.21099 54.5447
## 3 E06000003 Redcar and Cleveland 464361 519597 -1.00608 54.5675
## 4 E06000004 Stockton-on-Tees 444940 518183 -1.30664 54.5569
## 5 E06000005 Darlington 428029 515648 -1.56835 54.5353
## 6 E06000006 Halton 354246 382146 -2.68853 53.3342
## GlobalID geometry
## 1 2f400ad8-2901-4ac7-b323-2de4588b680e MULTIPOLYGON (((449812.9 52...
## 2 567ad05a-69c2-4063-b76b-1a8f661a8c03 MULTIPOLYGON (((446860 5172...
## 3 851f8a8c-3a68-441d-a687-8c91ca574ca1 MULTIPOLYGON (((451747.4 52...
## 4 166e0146-3a5c-4cda-95f4-68893c9d063d MULTIPOLYGON (((446997.1 51...
## 5 0ebbe705-9e0e-44fc-939d-e8839205db57 MULTIPOLYGON (((423475.7 52...
## 6 522a702c-ca2c-4009-b93a-b96e00de165e MULTIPOLYGON (((358374.7 38...
# Better to just view via environment pane
# Inspect dimensions
dim(sf)
## [1] 374 8
# length() with the unique() function gives us the number of unique values in a column
length(unique(sf$LAD22NM))
## [1] 374
Hmm.We have 331 local authorities in our dataset that we want to plot, but there are 374 listed here. We’ll need to remove the local authorities that don’t match the ones in our df.
# Use rename function so sf columns match those in original df
sf <- sf %>%
rename(LA_code = LAD22CD,
LA_name = LAD22NM)
# Let's see if it worked
colnames(sf)
## [1] "LA_code" "LA_name" "BNG_E" "BNG_N" "LONG" "LAT" "GlobalID"
## [8] "geometry"
# Replace specific values in the LA_name column using recode()
sf$LA_name <- sf$LA_name %>%
recode(`Bristol, City of` = "Bristol",
`Kingston upon Hull, City of` = "Kingston upon Hull",
`Herefordshire, County of` = "Herefordshire")
This is used to check if elements of one list are in another list. Much like the logical operators, it returns a boolean value TRUE or FALSE. And we only keep rows in the LA_code for the ‘sf’ dataset, if they are present in the LA_code column in ‘df’.
# Use filter() with %in% and unique() to only keep LA's that match
sf <- sf %>%
filter(LA_code %in% unique(df$LA_code))
# Let's see how it looks..
# We should have 331 unique LA_codes
length(unique(sf$LA_code))
## [1] 331
When it comes to mapping our data, it is important that we know which Coordinate Reference System (CRS) we are working with. Simply put, the CRS is a way to describe how the spatial data in the ‘sf’ object maps to locations on earth. The CRS is just a way of translating 3D reality into 2D maps. And when it comes to using mapping libraries like ‘leaflet’, knowing the CRS is important because leaflet expects coordinates in a specific format (usually latitude and longitude), which is EPSG:4326. If our CRS isn’t in this format then we might need to transform it so that it matches what leaflet expects. Let’s go ahead and see what our CRS is saying.
# st_crs() shows our CRS info
st_crs(sf)
## Coordinate Reference System:
## User input: OSGB 1936 / British National Grid
## wkt:
## PROJCRS["OSGB 1936 / British National Grid",
## BASEGEOGCRS["OSGB 1936",
## DATUM["OSGB 1936",
## ELLIPSOID["Airy 1830",6377563.396,299.3249646,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4277]],
## CONVERSION["British National Grid",
## METHOD["Transverse Mercator",
## ID["EPSG",9807]],
## PARAMETER["Latitude of natural origin",49,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8801]],
## PARAMETER["Longitude of natural origin",-2,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8802]],
## PARAMETER["Scale factor at natural origin",0.9996012717,
## SCALEUNIT["unity",1],
## ID["EPSG",8805]],
## PARAMETER["False easting",400000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8806]],
## PARAMETER["False northing",-100000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8807]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["Engineering survey, topographic mapping."],
## AREA["United Kingdom (UK) - offshore to boundary of UKCS within 49°45'N to 61°N and 9°W to 2°E; onshore Great Britain (England, Wales and Scotland). Isle of Man onshore."],
## BBOX[49.75,-9,61.01,2.01]],
## ID["EPSG",27700]]
# To transform our crs to EPSG: 4326, simply use st_transform() and specify the crs
# Note: you don't have to use the %>% pipe operator all the time
sf <- st_transform(sf, crs = 4326)
What we want to do now is merge our ‘df’ dataframe with our ‘sf’ spatial object, so that we can directly access the data and map it!
When you use the merge function in R, the order in which you place the data matters in terms of the result’s class type and spatial attributes. So, in terms of class type, we have a dataframe and a spatial object. By placing ‘sf’ first, the result will be a spatial object, which is important because this retains the spatial characteristics and geometry columns of the ‘sf’ object. We merge the columns on the LA_code and LA_name columns which are present in both datasets.
Don’t overthink it. It’s just a way to group items together in R, whether for defining a set of values to work with, specifying parameters for a function, or any number of other uses where a list of items is needed.
# Merge the dataframes
merged <- merge(sf, df, by = c('LA_code', 'LA_name'))
# Let's check it out
head(merged)
## Simple feature collection with 6 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -2.832457 ymin: 53.30502 xmax: -0.788609 ymax: 54.72716
## Geodetic CRS: WGS 84
## LA_code LA_name BNG_E BNG_N LONG LAT
## 1 E06000001 Hartlepool 447160 531474 -1.27018 54.6761
## 2 E06000002 Middlesbrough 451141 516887 -1.21099 54.5447
## 3 E06000003 Redcar and Cleveland 464361 519597 -1.00608 54.5675
## 4 E06000004 Stockton-on-Tees 444940 518183 -1.30664 54.5569
## 5 E06000005 Darlington 428029 515648 -1.56835 54.5353
## 6 E06000006 Halton 354246 382146 -2.68853 53.3342
## GlobalID GI_code GI_cat Percentage
## 1 2f400ad8-2901-4ac7-b323-2de4588b680e 4 Trans man 0.07
## 2 567ad05a-69c2-4063-b76b-1a8f661a8c03 4 Trans man 0.15
## 3 851f8a8c-3a68-441d-a687-8c91ca574ca1 4 Trans man 0.09
## 4 166e0146-3a5c-4cda-95f4-68893c9d063d 4 Trans man 0.08
## 5 0ebbe705-9e0e-44fc-939d-e8839205db57 4 Trans man 0.08
## 6 522a702c-ca2c-4009-b93a-b96e00de165e 4 Trans man 0.08
## geometry
## 1 MULTIPOLYGON (((-1.23 54.62...
## 2 MULTIPOLYGON (((-1.277103 5...
## 3 MULTIPOLYGON (((-1.200967 5...
## 4 MULTIPOLYGON (((-1.274913 5...
## 5 MULTIPOLYGON (((-1.637991 5...
## 6 MULTIPOLYGON (((-2.626836 5...
Finally, we can now build out interactive map using leaflet. You can see from the ‘geometry’ column that we’re working with ‘MULTIPOLYGON’s’ and ‘POLYGON’s’. Multipolygons are a collection of polygons grouped together as a single geometric entity. Basically, multipolygons are good at representing complex shapes. We also have some standard polygons too. In total we have 331 shapes to plot, each representing a local authority. You can take a look at these separate shapes by using the plot function and indexing the row and column (see below).
plot(sf[1, 'geometry'])
The code below has helpful code comments that should help you grasp what each bit of the code is doing. But, to provide the overall picture, what we have below is some code for our colour palette which will create a colour scale for the range of values in our ‘Percentage’ column. Then, we create our interactive map which we’ve named ‘uk_map’. We center our map, add some default map tiles, add our polygons, colour them, then add in the interactive elements such as highlight options (how background changes when cursor hovers over a shape) and label (which specifies tooltips). Then, we add a legend. Finally, we can display this interactive map.
# Define the color palette for filling in our multipolygon shapes
# domain sets the range of data values that the colour scale should cover
color_palette <- colorNumeric(palette = "YlGnBu", domain = merged$Percentage)
# Use leaflet function with 'merged' dataset
uk_map <- leaflet(merged) %>%
# Centers the map on long and lat for UK
setView(lng = -3.0, lat = 53, zoom = 6) %>%
# Adds default map tiles (the visual image of the map)
addTiles() %>%
# Adds multipolygons to the map, and colours them based on the 'Percentage' column
# We use the palette we created above
addPolygons(
fillColor = ~color_palette(Percentage),
weight = 1, # Set the border weight to 1 for thinner borders
color = "#000000",
fillOpacity = 0.7,
highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE),
label = ~paste(LA_name, ":", Percentage, "%"), # This will create tooltips showing the info
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "12px", direction = "auto") # Adjust text size as needed
) %>%
addLegend(pal = color_palette, values = ~Percentage, opacity = 0.7, title = "Percentage", position = "topright")
# Render the map
uk_map